SeqProp - Protein Sequence Properties¶

This notebook gives an overview the available calculations for properties of a single protein sequence.

Input: Amino acid sequence

Output: Amino acid sequence properties

Note

See ssbio.protein.sequence.seqprop.SeqProp for a description of all the available attributes and functions.

Imports¶

In [ ]:

import sys
import logging
import os.path as op

In [ ]:

# Import the SeqProp class
from ssbio.protein.sequence.seqprop import SeqProp

In [ ]:

# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Logging¶

Set the logging level in logger.setLevel(logging.<LEVEL_HERE>) to specify how verbose you want the pipeline to be. Debug is most verbose.

CRITICAL
- Only really important messages shown
ERROR
- Major errors
WARNING
- Warnings that don’t affect running of the pipeline
INFO (default)
- Info such as the number of structures mapped per gene
DEBUG
- Really detailed information that will print out a lot of stuff

In [ ]:

# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #

In [ ]:

# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

Initialization of the project¶

Set these two things:

PROTEIN_ID
- Your protein ID
PROTEIN_SEQ
- Your protein sequence

In [ ]:

# SET IDS HERE
PROTEIN_ID = 'YIAJ_ECOLI'
PROTEIN_SEQ = 'MGKEVMGKKENEMAQEKERPAGSQSLFRGLMLIEILSNYPNGCPLAHLSELAGLNKSTVHRLLQGLQSCGYVTTAPAAGSYRLTTKFIAVGQKALSSLNIIHIAAPHLEALNIATGETINFSSREDDHAILIYKLEPTTGMLRTRAYIGQHMPLYCSAMGKIYMAFGHPDYVKSYWESHQHEIQPLTRNTITELPAMFDELAHIRESGAAMDREENELGVSCIAVPVFDIHGRVPYAVSISLSTSRLKQVGEKNLLKPLRETAQAISNELGFTVRDDLGAIT'

In [ ]:

# Create the SeqProp object
my_seq = SeqProp(id=PROTEIN_ID, seq=PROTEIN_SEQ)

SeqProp.write_fasta_file(outfile, force_rerun=False)[source]

Write a FASTA file for the protein sequence, seq will now load directly from this file.

Parameters:	outfile (str) – Path to new FASTA file to be written to force_rerun (bool) – If an existing file should be overwritten

In [ ]:

# Write temporary FASTA file for property calculations that require FASTA file as input
import tempfile
ROOT_DIR = tempfile.gettempdir()

my_seq.write_fasta_file(outfile=op.join(ROOT_DIR, 'tmp.fasta'), force_rerun=True)
my_seq.sequence_path

Computing and storing protein properties¶

A SeqProp object is simply an extension of the Biopython SeqRecord object. Global properties which describe or summarize the entire protein sequence are stored in the annotations attribute, while local residue-specific properties are stored in the letter_annotations attribute.

Basic global properties¶

SeqProp.get_biopython_pepstats()[source]: Run Biopython’s built in ProteinAnalysis module and store statistics in the annotations attribute.

In [ ]:

# Global properties using the Biopython ProteinAnalysis module
my_seq.get_biopython_pepstats()
{k:v for k,v in my_seq.annotations.items() if k.endswith('-biop')}

SeqProp.get_emboss_pepstats()[source]

Run the EMBOSS pepstats program on the protein sequence.

Stores statistics in the annotations attribute. Saves a .pepstats file of the results where the sequence file is located.

In [ ]:

# Global properties from the EMBOSS pepstats program
my_seq.get_emboss_pepstats()
{k:v for k,v in my_seq.annotations.items() if k.endswith('-pepstats')}

SeqProp.get_aggregation_propensity(email, password, cutoff_v=5, cutoff_n=5, run_amylmuts=False, outdir=None)[source]

Run the AMYLPRED2 web server to calculate the aggregation propensity of this protein sequence, which is the number of aggregation-prone segments on the unfolded protein sequence.

Stores statistics in the annotations attribute, under the key aggprop-amylpred.

See ssbio.protein.sequence.properties.aggregation_propensity for instructions and details.

In [ ]:

# Aggregation propensity - the predicted number of aggregation-prone segments on an unfolded protein sequence
my_seq.get_aggregation_propensity(outdir=ROOT_DIR, email='nmih@ucsd.edu', password='ssbiotest', cutoff_v=5, cutoff_n=5, run_amylmuts=False)
{k:v for k,v in my_seq.annotations.items() if k.endswith('-amylpred')}

SeqProp.get_kinetic_folding_rate(secstruct, at_temp=None)[source]

Run the FOLD-RATE web server to calculate the kinetic folding rate given an amino acid sequence and its structural classficiation (alpha/beta/mixed)

Stores statistics in the annotations attribute, under the key kinetic_folding_rate_<TEMP>-foldrate.

See ssbio.protein.sequence.properties.kinetic_folding_rate.get_foldrate() for instructions and details.

In [ ]:

# Kinetic folding rate - the predicted rate of folding for this protein sequence
secstruct_class = 'mixed'
my_seq.get_kinetic_folding_rate(secstruct=secstruct_class)
{k:v for k,v in my_seq.annotations.items() if k.endswith('-foldrate')}

SeqProp.get_thermostability(at_temp)[source]

Run the thermostability calculator using either the Dill or Oobatake methods.

Stores calculated (dG, Keq) tuple in the annotations attribute, under the key thermostability_<TEMP>-<METHOD_USED>.

See ssbio.protein.sequence.properties.thermostability.get_dG_at_T() for instructions and details.

In [ ]:

# Thermostability - prediction of free energy of unfolding dG from protein sequence
# Stores (dG, Keq)
my_seq.get_thermostability(at_temp=32.0)
my_seq.get_thermostability(at_temp=37.0)
my_seq.get_thermostability(at_temp=42.0)
{k:v for k,v in my_seq.annotations.items() if k.startswith('thermostability_')}